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Abstract. In recommendation systems, one is interested in the ranking of 
the predicted items as opposed to other losses such as the mean squared error. 
Although a variety of ways to evaluate rankings exist in the literature, here 
we focus on the Area Under the ROC Curve (AUC) as it widely used and 
has a strong theoretical underpinning. In practical recommendation, only 
items at the top of the ranked list are presented to the users. With this 
in mind, we propose a class of objective functions over matrix factorisations 
which primarily represent a smooth surrogate for the real AUC, and in a 
special case we show how to prioritise the top of the list. The objectives are 
differentiable and optimised through a carefully designed stochastic gradient- 
descent-based algorithm which scales linearly with the size of the data. In the 
special case of square loss we show how to improve computational complexity 
by leveraging previously computed measures. To understand theoretically the 
underlying matrix factorisation approaches we study both the consistency of 
the loss functions with respect to AUC, and generalisation using Rademacher 
theory. The resulting generalisation analysis gives strong motivation for the 
optimisation under study. Finally, we provide computation results as to the 
efficacy of the proposed method using synthetic and real data. 


1. Introduction 

A recommendation system Dlls] takes a set of items that a user has rated and 
recommends new items that the user may like in the future. Such systems have 
a broad range of applications such as recommending books [TJ, CDs [17], movies 
[27l[3] and news |7]. To formalise the recommendation problem let {wi,..., Wm} C 
W be the set of all users and let {yi,..., i/„} C V be the items that can be rec¬ 
ommended. Each user Wi rates item yj with a value which measures whether 
item j/j is useful to Wi. In this work, we consider the implicit recommendation 
problem in which G 72. = {—1,-|-1}. For example, the rating value represents 
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whether a person has read a particular research paper, or bought a product. We 
are thus presented with a set of observations {{wi,yj,rij) : Wi S W,?/j G 3^} as a 
training sample. The aim of recommendation systems is to find a scoring function 
/ : W X 3^ i-A M such that the score f{w,y) is high when user w strongly prefers 
item y. This problem is challenging for a number of reasons: we are interested only 
in the top few items for each user, there is often a large fraction of missing obser¬ 
vations (irrelevant items are generally unknown) and the sample of rated items is 
often drawn from a non-uniform distribution. 

To address these issues, we propose a general framework for obtaining strong 
ranking of items based on the theoretically well studied local Area Under the ROC 
Curve (AUC) metric. The framework relies on matrix factorisation to represent 
parameters, which generally performs well and scales to large numbers of users and 
items. One essentially finds a low rank approximation of the unobserved entries 
according to an AUC-based objective and several different surrogate loss functions. 
In addition we show how to focus the optimisation to the items placed at the top of 
the list. The resulting methods have smooth differentiable objective functions and 
can be solved using stochastic gradient descent. We additionally show how to per¬ 
form the optimisation in parallel so it can make effective use of modern multicore 
CPUs. The novel algorithms are studied empirically in relation to other state-of- 
the-art matrix factorisation methods on a selection of synthetic and real datasets. 
We also answer some theoretical questions about the proposed methods. The first 
question is whether optimising the surrogate functions will result in improving the 
AUC. The second question represents a generalisation analysis of the matrix fac¬ 
torisation approaches using Rademacher Theory [2] , a data-dependent approach to 
obtaining error bounds. The analysis sheds some light onto whether the quantities 
optimised on a training sample will generalise to unseen observations and provides 
a bound between these values. Note that a preliminary version of this paper has 
been presented in [9] and here we extend the work by considering a much larger 
class of objectives, the theoretical study and further empirical analysis. 

This paper is organised as follows. In the next section we motivate and derive 
the Matrix Factorisation with AUC (MFAUC) framework and present its special¬ 
isation according to a variety of objective functions. In Section [3] we present the 
theoretical study on consistency and generalisation of the proposed approaches. 
Following, there is a review on related work on matrix factorisation methods in¬ 
cluding those based on top-.^ rank-based losses. The MFAUC algorithm is then 
evaluated empirically in Section [S] and finally we conclude with a summary and 
some perspectives. 

Notation: A bold uppercase letter represents a matrix, e.g. X, and a column 
vector is displayed using a bold lowercase letter, e.g. x. The transpose of a matrix 
or vector is written X^. The indicator function is given by /(•) so that it is I 
when its input is true otherwise it is . The all ones vector is denoted by j and the 
corresponding matrix is J. 


2. Maximising AUC 

The Area Under the ROC Curve is a common way to evaluate rankings. Consider 
user w G W then the AUC is the expectation that a uniformly drawn positive item 
y is greater with respect to a negative item y' under a scoring function s : 3^ i—>■ K. 

( 1 ) AUC^,(s)=E^,[I(s(y)>s(z/0)], 
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where 2? is a distribution over items for w, X is the indicator function that is 1 when 
its input is true and otherwise 0, and we assume scores are never equal. One cannot 
in general maximise AUC directly since the indicator function is non-smooth and 
non-differentiable and hence difficult to optimise. Our observations consist only 
of positive relevance for items, thus we maximise a quantity closely related to the 
AUC for all users, which is practical to optimise. The missing observations are 
considered as negative as in [SlES] for example. 

Accuracy experienced by users is closely related to performance on complete 
data rather than available data [25] and thus the distribution T) is an important 
consideration in a practical recommendation system. This implies that a non- 
uniform sampling of items might be beneficial. Consider a user w and a rating 
function s, then the empirical AUC for this user is: 





where uj is the set of indices of relevant items for user w, ui is the complement of w, 
and g and g' are distributions on the relevant and irrelevant items respectively. The 
most intuitive choices for g and g' are the uniform distributions. On real datasets 
however, item ratings often have a long tail distribution so that a small number 
of items represent large proportions of the total number of ratings. This opens up 
the question as to whether the so-called popularity bias might be an advantage to 
recommender systems that predict using the same distribution and we return to 
this point later. 

The AUC certainly captures a great deal of information about the ranking of 
items for each user, however as pointed out by previous authors, in practice one 
is only interested in the top items in that ranking. Two scoring functions s and 
s' could have identical AUC value and yet s for example scores badly on the top 
few items but recovers lower down the list. One way of measuring this behaviour 
is through local AUC [5] which is the expectation that a positive item is scored 
higher than a negative one, and the positive one is in the top tth quantile. This is 
equivalent to saying that we ignore positive items low down in the ranking. 


2.1. A General Framework. As previously stated, direct maximisation of AUC 
is not practical and hence we use a selection of surrogate functions to approximate 
the indicator function. Here a scoring function is found for all users and items, in 
particular the score for the *th user and jth item is given by s{wi, yj) = (UV^)ij 
where U G is a matrix of m user factors and V G is a matrix of n item 

factors. We will refer to the Ah rows of these matrices as and respectively. 
Furthermore, let X G be a matrix of ratings such that X^- = -|-1 if user Wi 

finds item yj relevant otherwise Xy = 0 if the item is missing or irrelevant. In a 
sense which is made clear later, X is an approximation of the complete structure 
of learner scores so that X r; UV^. The advantage of this matrix factorisation 
strategy is that the scoring function has k{m + n) parameters and hence scales 
linearly with both the number of users and items. 
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The framework we propose is composed principally of solving the following gen¬ 
eral optimisation: 


( 3 ) 




i—l p^oji 




Vl||U||| + l||V| 
Z \m n 


for a user-defined regularisation parameter A, loss function L, rank weighting func¬ 
tion (j), item difference = ujvp — ujvq and distributions for relevant and 

irrelevant items g and g'. The relevant items for the ith user are given by the 
set oji and irrelevant/missing items are indexed in uJi. The first sum in the first 
term averages over users and the second sum computes ranking losses over positive 
items for the ith user. The second term is a penalisation on the Frobenius norms 
(||A|||. = Afj) of the user and item factor matrices. Concrete item distribu¬ 
tions g and g’ are discussed later in this section but a simple case is using the 
uniform distributions l/|wi| and l/|wi| respectively. 

The loss function L can be specialised to one of the following options (^ > 0 is 
a user-defined parameter): 


L{x) = i max(0,1 — 

L{x) = i(l — xY 
lIx) = -l/{l + e-P^) 

L{x) = — ln(l/(l -I- e~^^)) 


square hinge 
square 
sigmoid 
logistic 


See Figure[T]for a graphical representation of each of these loss functions. On binary 
classification the square hinge loss is show to provide a strong AUC on both training 
and test data in |29) . Furthermore, the objective becomes convex in U and V when 
4>{x) = X. The squared loss is shown to be consistent with the AUC [TT] and the 
squared hinge loss is shown to be consistent in [12]. In contrast, the hinge loss is 
not consistent with AUC. The sigmoid function is perhaps the best approximation 
of the negative indicator function. As /3 —>■ oo it approaches the indicator function, 
and hence we get an objective exactly corresponding to maximising AUC. The 
sigmoid function is used in conjunction with AUC for bipartite ranking in El- 
An noted in this paper, if /3 is too small then corresponding objective is a poor 
approximation of the AUC and if it is too large then the objective approaches the 
sum of step functions making it hard to perform gradient descent. To alleviate the 
problem the training data is used to to choose a series of increasing /3 values. 

For the weighting function (j){x) = x, one can see that the first term in the 
objective follows directly from AUC (Equation[2|) by replacing the indicator function 
with the appropriate loss. The optimisation in this case is convex in the square and 
hinge loss cases for U and V but not both simultaneously. This form of the objective 
however, does not take into account the preference for ranking items correctly at 
the top of the list. Consider instead (j){x) = tanh(px) for x > 0 and fixed p > 0, 
which is a concave function. The term inside (j) in Optimisation [3| represents a 
ranking loss for ith user and pth item and thus yp is high up in the ranked list if 
this quantity is small. The effect of choosing the hyperbolic tangent is that items 
with small losses have larger gradients towards the optimal solution and hence are 
prioritised. 

The distributions on the items allow some flexibility into the expectation we 
ultimately compute. Although the obvious choice is a uniform distribution, in 
practice relevant item distributions often follow the so-called power law given by 
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Figure 1. Plot of the loss functions L with /3 = 5. 


p{y) oc riy'^ for some exponent r > 1, where Uy is the number of times relevant item 
y occurs in the complete (fully observed) data. In words, the power law says that 
there are a few items which are rated very frequently whereas most items are not 
frequently rated, corresponding to a bias on observing a rating for popular items. 
However, recommendations for less well-known items are considered valuable for 
users. Since we have incomplete data the generating distribution can be modelled 
in a similar way to that of the complete data (see e.g. |26] for further motivation), 

g{y) ocp(y)'"'. 

where p{y) is the empirical probability of observing y and r' > 0 is a exponent used 
to control the weight of items in the objective. The irrelevant and missing items 
form the complement of the relevant ones and hence we have 

g'{y) oc q{yy' = {1 - p{y)y'. 

Notice that when t' = 0 we sample from the uniform distribution. Since we expect 
p{y) to be related to Uy with a power law this implies that when r' > 0 we give 
preference to items for which Uy is small and hence focus on less popular items. 


2.2. Optimisation Algorithms. To solve the above objectives one can use gra¬ 
dient descent methods which rely on computing the derivatives with respect to the 
parameters U and V. Here we present the derivatives for choices of loss L and 
weighting function ij), starting with the squared hinge loss and ^(a;) = x. Denote 
the objective function of Optimisation |3] as 9 then the derivatives are. 


( 4 ) 

and 

( 5 ) 


60 1 
i5u. 


" XI diVp) ( X “ ^p)Kl^,P,q)9'{yq) ) + 

lit flit 

p^iOi \q^iOi / 


66 1 ^ I 

T— =—X"* X 9{yp)h{l^,p,]) 


(5v,- m 

i—1 




-^jeujigiyj) 9'{yq)h{'yi.j,q) j + 

qeOi / 


A 
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where h{x) = max(0 ,1 — x) and for convenience we use the notation = I{j G 

uji). The squared loss is identical except that h{x) = (1 — x) in this case. It is 
worth noting that the term inside the outer sum of this derivative in the squared 
loss case can be written as: 


( 6 ) 


56 

Svi 


=— > u, 

m 


+ uf (vj- - vO) 


+ uf (v* - Vj))) + 

where v* = ff(yp)vp and = Y.qecui d'{yq)^q are empirical expectations. 

The derivative with respect to can be treated in a similar way: 


( 7 ) 




— (Vi - Vi + Wj - Vivfui - Vivfui + Wi) H-Ui, 

m m 


where Wi = and w* = J2q(^a,i9'iyq)^<i^q^^- Thus we have 

inexpensive ways to compute derivatives in the square loss case provided we have 
access to the expectations of particular vectors. 

The logistic and sigmoid losses are similar functions and their derivatives are 
computed as follows (/3 is a user-defined parameter and (fi^x) = x as before): 


( 8 ) 


and 


(9) 


dui 


pGOJi 



^p)h{'y^,p,q 



A 

+ -Ui, 

n 


59 


^ 9{yp)hh^,p,j) 


pGUJt 


Tj'ewi aiVj) X! 9'{yq)h{li,j,q) 1 + 


in which h{x) = for the logistic loss and h{x) = for the sigmoid 

loss. 

We now consider the case in which we have the weighting function (/)(x) = 
tanh(px) on the item losses in conjunction with a square or squared hinge loss. 
The gradients with respect to the objective 9 are 


* pGi^i \qeCJt / 

^1 - tanh^ Hlt,p,qf9'iyq)J^ + ^Ui, 

and the corresponding gradient with respect to Vj is 

4 - = — (^jeco.g'iVj) Y 9iyp)Hli,P,j) • -tanh^ ( ^ 9'(yq) 


5vj m 

1 — 1 




q^LOi 


-^jecoi9{yj) Y 9'{yq)Kli,3,q) ■ - tanh^ Y j ^ 
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with h{x) = inax(0, 1 — x) in the square hinge loss case and h{x) = {1 — x) for the 
square loss. When we compare the above derivatives to Equations |4] and [5] for the 
square/hinge loss functions we can see that there is an additional weighting on the 

relevant items given by /(u^, Vp) = 1 - tanh^ “ ^J^qfg'iVq)) 

which will naturally increase the size of gradient for items with small losses and 
hence those high up in the ranking. 

Thus we have presented the derivatives for each of the loss functions we proposed 
earlier in this section as well the hyperbolic tangent weighting scheme. The deriva¬ 
tive with respect to Vj is generally the most costly to find and we can see that the 
computational complexity of computing it is 0(rnn) and hence the derivative with 
respect to the complete matrix V is 0{mn'^). The complexity of the derivative 
with respect to U is identical although in practice it is quicker to compute. Fortu¬ 
nately, the expectations for both derivatives can be estimated effectively using Kyy 
users, and Ky items from oji and uii. This reduces the complexity per iteration of 
computing the derivative with respect to V to 0{K\yKy). 

Using these approximate derivatives we can apply stochastic gradient descent to 
solve Optimisation 131 Define 0'(U,V) as the approximate subsampled objective, 
then one then walks in the negative direction of the approximate derivatives using 
average stochastic gradient descent (average SGD, [22]). Algorithm [1] shows the 
pseudo code for solving the optimisation. Given initial values of U and V (using 
random Gaussian elements, for example) we define an iteration as max(m, n) gradi¬ 
ent steps using a random permutation of indices i € [1, m] and j G [1, nj. It follows 
that each iteration updates all of the rows of U and V at least once according to 
the learning rate a G R+ and the corresponding derivative. Note in line [6| that if 
t exceeds the size of the corresponding vector we wrap around to the start. Under 
average SGD one maintains, during the optimisation, matrices U and V which are 
weighted averages of the user and item factors. These matrices can be shown to 
converge more rapidly than U and V, see [22] for further details. The algorithm 
concludes when the maximum number of iterations T has been reached or conver¬ 
gence is achieved by looking at the average objective value for the the most recent 
iterations. In the following text we refer to this algorithm as Matrix Factorisation 
with AUC (MFAUC). 


2.2.1. Parallel Algorithm. A disadvantage of the optimisation just described is that 
it cannot easily be implemented to make use of model parallel computer architec¬ 
tures as each intermediate solution depends on the previous one. To compensate, 
we build upon the Distributed SGD (DSDG) algorithm of [ 13 ]. The algorithm 
works on the assumption that the loss function we are optimising can be split up 
into a sum of local losses, each local loss working on a subset of the elements of . 
The algorithm proceeds by partitioning X into di x d 2 fixed blocks (sub-matrices) 
and each block is processed by a separate process/thread ensuring that no two 
concurrent blocks share the same row or column. This constraint is required so 
that updates to the rows of U and V are independent, see Figure |21 The blocks 
do not have to be contiguous or of uniform size, and in our implementation blocks 
are sized so that the number of nonzero elements within each one is approximately 
constant. The algorithm terminates after every block has been processed at least 
T times. 
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Algorithm 1 Pseudo code for Matrix Factorisation with AUC 


Require: Ratings X, solutions U, V, iterations T, average start Tq, learning rate 
a, convergence threshold e 


1 

2 

3 

4 

5 

6 

7 

8 


U = U and V = V 
Vector z is permutation of m} and z' is permutation of {1,..., n} 

0' = 0'(U,V), 0;=0'+e, s = l 
while s and |0' — 0'_i| > e do 
for t = 1,..., max(m, n) do 
Let i = Zi and j = zj 
Ui ^ Ui — a-^ and Vj <— Vj — cn-^ 
if r > To then 


9 

10 

11 

12 

13 

14 

15 

16 


u, ^ and Vj ^ j^Vj + 

else 




Ui = Ui and Vj = Vj 


end if 
end for 

Update 0' = 0'(U, V), s = s + 1 

end while 

return Solutions tJ, V 



Figure 2. Illustration of the principal behind DSGD. Each of the 
two large squares represents a matrix divided into blocks, and each 
black square represents a process working on a block. 


For AUC-based losses we require a pairwise comparison of items for each row and 
therefore the loss cannot easily be split into local losses. To get around this issue 
we modify DSGD as follows: at each iteration we randomly assign rows/columns 
to blocks, i.e. blocks are no longer fixed throughout the algorithm. We then 
concurrently compute empirical estimates of the loss above within all blocks by 
restricting the corresponding positive and negative items, and repeat this process 
until convergence or for T iterations. 

3. Consistency and Generalisation 

In this section we will discuss issues relating to the consistency of the above 
optimisation, as well as the generalisation performance on unseen data. First we 
address the question of whether optimising a surrogate of the AUG will lead to 
improving the AUC itself. We draw upon the work of [T^] to show that our chosen 
loss functions are consistent. In the bipartite ranking case the square hinge loss is 
shown to be consistent in [T^ and a similar result for the square loss is proven in 
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Therefore we concentrate on the sigmoid and logistic functions and additionally 
show how the consistency results apply to the matrix factorisation scenario we 
consider in this paper. 

To begin we define consistency in a more formal sense, starting with a more 
general definition of the AUC than Equation [T] In this case, the scoring function s 
can result in the same scores for different items, 

AUCi 5 (s) = - r')(s(y) - s{y')) > 0) + = s{y'))\r ^ r% 

where r,r' G {—1, +1} are the labels respectively of items y, y'. We can alternatively 
state the AUC in terms of a risk so that maximising AUC corresponds to minimising 
this quantity: 

R{s) = E(^y^r),{y>,r')^v[Ks, (Z/U), 1 "'] 

in which £(s, (y, r), {y', r')) = I((r - r'){s{y) - s(y')) < 0) + ^2:(s(y) = s(y')). If we 
define r/(y) = P[r = +l|j/] then we see that the risk can be expressed as 

R{s) oc E(y [77(j/)(l - r]{y'))e'{s,y,y') +r]{y'){l - T](y))e'{s,y',y)], 

where £'{s,y,y') = R{s{y) — s{y') < 0) + ^I{s{y) = s{y'j) and Vy is the marginal 
distribution over y. Notice that if we assume that r](jj) > rjiy') then we would 
prefer a function s such that s{y) > s{y'), and a similar results holds if we swap 
y and y'. This allows us to introduce the Bayes risk R{s*) where s* is defined as 
follows: 

s* = arginfgi?(s) 

= {s ■■ {s{y) - s{y')){T]{y) - rj{y')) > 0 if rj{y) T]{y')}, 

where s is chosen from all measurable functions. Since we replace the indicator with 
a surrogate loss function, define T'(s, y, y') = L{s{y) — s{y')) then we can write the 
corresponding risk as, 

i?L(s) oc E(j^ [r]{y){l - r]{y'))L'{s, y, y') + 'n{y'){l - r]{y))L'{s, y', y)], 

and the optimal scoring function with respect to this risk is s\ = arginfgi?i(s). 
With these definitions, we can define consistency. 

Definition 3.1 (Consistency, [12)1. The surrogate loss L is said to be consistent 
with AUC if for every sequence following holds over all distribu¬ 

tions V on y X TZ: 

^ Rl{sI) then ^ 

Thus, a consistent loss function will lead to an optimal Bayes risk in the limit 
of an infinite sample size. Furthermore, a sufficient condition is given for AUC as 
follows 

Theorem 3.1 (Sufficiency for AUC consistency, [E]). The surrogate loss L'{s, y, y') 
L{s{y) — s(y')) is consistent with AUC if L ^ M. is a convex, differentiable and 
non-increasing function with AL(0) < 0. 

These theorems allow us to prove that the sigmoid and logistic losses are consis¬ 
tent with AUC. 

Theorem 3.2. Both the sigmoid —a{x), <j{x) = 1/(1 -I- and logistic loss 

— ln((T(a;)) = ln(l/(l -|- e~^^)) functions are consistent with AUC. 
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Proof. We will start with the logistic function. Assume that /3 > 0 and x is finite, 

J - ln(a) _ 

5x (1 + e~P^) 

which implies a non-increasing function, in particular the derivative at zero is —j312. 
In addition the logistic loss is convex since for all finite x, 

(5^ — In(cr) 

Sx"^ (1 -I- e~P^) \ (1 -I- e~P^) 

where the first term in the derivative is greater than zero and the second term in 
parenthesis is between 0 and 1. An application of Theorem 13.11 gives the required 
result. 

For the sigmoid function, consider a set of items S = {yi,y 2 , ■ ■ ■, Vn} with cor¬ 
responding conditional probabilities on being positive ly(yi), 17 ( 2 / 2 ), • ■ •, viym) then 
we can find the following risk (we use notation iji = r]{yi) and Si = s{yi) for conve¬ 
nience) 



Rl{s) oc Rfis) = - ^ f ??z(l - Vj) ^ ~ 

i<j 




For each pair of terms in this sum notice first that 1/(1 -|- e““) + 1/(1 + e“) = 1 
for all X. It follows that if r]i > rjj then the first term should be maximised by 
increasing Si — Sj otherwise the second one should be maximised to minimise the 
risk. Therefore, in the limiting case 

^l(s) 

m>Vj 

and we have (s^ — Sj){rji > rjj) >0 when rji ^ rjj as in the Bayes risk. □ 


As we have not made any assumption on scoring functions in our previous results, 
the AUC is simply generalised to the expectation over users as follows 

AUCi,(s) = Ewxv[2:((t - r'){s{w,y) - s{w,y')) > 0) + ^I{s{y) = s{y'))\r ^ r')], 

where T) is now is a distribution over users and items and we redefine the scoring 
function as s : W x 1 —,> ffi.. Thus we have shown that all the loss functions we 
consider are consistent with the AUC in the multi-user scenario. 


3.1. Generalisation Bound. We now turn to another critical question about our 
algorithm relating to the generalisation. In particular we look at Rademacher 
theory, which is a data-dependent way of studying generalisation performance. 
We assume that the observations are sampled from a distribution V over W x 
y X TZ. Recall that only positive ratings are observed, hence the sample S = 
U™ i{(rci, yujii), ■ ■ ■, {wi, Vuiir,.)} is drawn from the distribution V = x • • • x 
where we use the shorthand rii = \uJi\. Now consider a class of functions Q mapping 
from W X 3^^ to [0,1] then the AUC can be maximised by minimising 


( 10 ) 


Es[Q] = -E 

m “ 


^-7 Y ] Y Q{wi,yp,yq), 

riin'- “ 

* p^iOi q^uJi 


in the case that Q = {Q : {w,y,y') I{s{w,y) < s{w,y')),s G S}. Likewise 
we can substitute any class of loss functions Q to be minimised, for example the 
surrogate functions defined above. Noting this, we can now present the idea of a 
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Rademacher variables which are uniformly distributed { — 1, +1} independent ran¬ 
dom variables. The following definition is an empirical Rademacher average for all 
users 


^A[/C(g) ^ 2E^ 


m 


sup — 
QGQ ^ 


E 


V,. 


, mn'^ “ 


Q{wi,yp,yq) 


where the Pi’s are independent Rademacher random variables (note the close con¬ 
nection to Equation 1101) . This expectation is an empirical measure of the capacity 
of a function class because it measures how well functions in that class can correlate 
with random data. We define the Rademacher complexity of Q as 




where S is drawn over 2?"^ x • • • x . In the following lemma we provide a 
relationship between these two quantities using McDiarmid’s Theorem. 

Theorem 3.3 (McDiarmid, |19]i. Let Xi ,..., Xn be independent random variables 
taking values in a set A and assume / : —>■ R satisfies 


sup . . ,X„) - f{xi, . . . . . ,X„)| < Ci, 


for all i = ... ,n. Then for all e > 0, 


P{f{Xi,X„) - E[f{Xi ,..., X„)\ > e} < exp f ^ 

\ 2^1=1 

and 

P{E[f{Xi, ..., Xn)] - f{Xi,Xn) > e} < exp f ^ 

\ 2^1—1 

We show that the Rademacher complexities are related as follows 

Theorem 3.4. Let Q be a function class mapping from Yd xy^ to [0,1] and define 
sample S = U(Ti{(iTi, ),■•■, (iTi, )} drawn from distribution x ••• x 

Pfp ■ Then with probability at least 1 — <5 the following is true 


R 


:AUC 


{Q)<R^^^{Q) + 


\ 


21n(l/(5)(n- 1)2 


y_L_ 


Proof. As well as S consider another sample Sab = S \ {{wa,yb)} U {{wa,yb')} for 
a € [1, m] and b, b' G [1, u]. For notational convenience we call the nonzero elements 
of Sab, w' = uii, i ^ a and ui'a = uji\ {yb} U {yb'}- Likewise for the zero elements 
uj] = uji, i ^ a and d;(, = Wi \ {yb'} U {yb}. Define 


1 

sup — y r“ 

QeC m \u}i 


V,. 




|a;i| 

SEW 

P=1 9=1 


Wi,yu,i^,ya. 


f{S) = 2E, 
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Therefore we have that {ml2) ■ {f{S) — f{Sab)) is equal to (the notation ujij denotes 
the jth element in oJi) 


= El. 


< Ei. 


= E,. 


= El/ 


m ^ \^i\ \^i\ m ^ \^i\ \^i\ 

/ m |t^i| l^il m \^i\ \^i\ 

SS, S irki S g ®‘”" S RRi S g ®‘”" ’ 


' p=i q= 
|£*Ja| l^^al 


' p=i q= 

l^'a I l^al 


sup 1 ]-gr-r gg gg Q{Wa,yu:^p,yu^ag) - I ,g-,| gg gg 


II II- I 

QeS V l^^alllVal 


E {Q{wa, Vb^Vq) - Q{Wa,yb' > Vq)) 




+ E {Qiwa,yp,yb') - Qiwa,yp,yb)) + Qiwa,yb,yb') - Q{wa,yb',yb) 

ypet^a.\{_Vb} 
l^al -t |Oa| — 1 
|uJa||Oa| 
n — 1 


|uJa||uJa| 


A similar derivation can be shown to bound m{f{Sab) — fiS)) using an identical 
term. If we let Cab = m\ip~\\^a \ then we can apply Theorem 13.31 and write E[/(5')] < 
f{S) + € with a probability greater than 1 — (5 for some <5 G [0,1] where 


S = exp 


-2e^ 


^ A^i—1 Z-^j=l ^ij ^ 


and rearranging gives e = ^ 2 in(i/^Kn i as required. □ 

To gain some insight into Theorem l3.4l we study the cases for which the empirical 
Rademacher complexity is close to its expectation. In the first setting, jwg is fixed 

and non-zero. Therefore the second term of the theorem is / 2iu(i/g)(”-i) 
tends to zero as soon as the number of users m tends to infinity. Similarly, if 
IwiI = 0(n) when n tends to infinity, the second term of the theorem is ©(l/i/mn) 
when n and m tends to infinity. Finally, when \uji \ is fixed and non-zero, the second 
term grows as 0(\/^) when n and m tend to infinity. In the last setting we 
require m to grow faster than n so that n/m tends to zero to enforce the empirical 
Rademacher complexity to be close to its expectation. 

We can now turn to concrete definitions of the function class Q relating to the 
matrix factorisation scenario we are interested in. Consider a specific form of the 
scoring function of the form Qy ^ {Q : {wi,yp,yq) M> h{ufvp - ufv,) | ||U||f < 
-Ri II Villi’ < i?} which allows us to analyse more deeply the empirical Rademacher 
complexity. Our goal is to bound the empirical Rademacher complexity based on 
the observations S. Before presenting a result we introduce two useful theorems. 
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Theorem 3.5 ([20]). Let ... ,(j)n be functions with respective Lipschitz constants 
7 i,... , 7 „, then the following bound holds: 



n 

sup y]cri(/)i(/(a:i)) 

< 

n 

sup Y] (Ti'yifixi) 




J 


where T is a function class on x G X and tri,..., cr„ are independent Rademacher 
variables. 


Theorem 3.6 f|10]'). Let B G be a matrix with singular values ui,... ,(7^ 

where r is the rank of A, and B = D^BDn be a multiplicative perturbation in 
which Dl and Du are nonsingular matrices. The the following holds on the singular 
values ai,... ,d'r of B: 






<di< (Ti\\D_ 


'L\\2\\Db\\2 


R II 2 


where ||j4||2 


\J ^max (2I ~Af — O’YnaxiX') 


is the spectral norm. 


We also introduce the following lemma whose utility will become apparently in 
the sequential theorem. 


Lemma 3.1. Consider a matrix A G then for a fixed integer k the solutions 

Ug and Vg of 

sup tr{ if" A V) 
s.t. II U\\f = 1 

II^ 1 Ie = i, 

are given by U = -^[Qi ■ ■ ■ Qi\ and V = where Pi and are the 

largest left and right singular vectors of A respectively. The corresponding value of 
the objective is ai, the largest singular value of A. 

Proof We introduce (/)(U,V) = tr(U^AV)-iAi(tr(U^U)-l)-iA 2 (tr(V'^V)-l) 
where Ai and A 2 are Lagrange multipliers. Taking derivatives, one obtains the 
following equations: 

(11) AV = AiU 

(12) A^U = A 2 V, 

where Ai and A 2 are Lagrange multipliers. By premultiplying the first equality by 
and the second by V then taking the trace one can see that Ai = A 2 = A is 
the objective value. By substitution of U = A“^AV into the second equality we 
obtain A^AV = A^V which implies that columns of V must be composed of a 
particular eigenvector of A^A. In a similar manner, one obtains AA^U = A^U. 
Denote the left and right singular values of A as p^,..., p,, and q^,..., q,,, where 
r is the rank of A, with corresponding singular values cri > (T 2 > ... > ctj-) then it 
is clear that U = oilq^ • • • q^], V = a 2 [Pi • • • Pi] and A = cti for some constants ai 
and 02 - To find the scaling factors ai and 02 we can write tr(U^U) = = 1 

and tr(V^V) = a 2 k^ = 1 which implies Oi = 02 = Ij'Jk. □ 


These theorems allow us to make the following statement concerning the empir¬ 
ical Rademacher complexity under a matrix factorisation setting. 
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Theorem 3.7. Consider a sample S = U™ • ■ •, drawn from 

distribution 2?"^ x ••• x . Define scoring functions of the form Qh = {Q : 

iw,,yp,yg) ^ h{ufvp-u[vg) \ t/e V€ R"x'=||C;i|f < < Ry} 

where h is Lipschitz with constant B and k is a fixed integer. Then the following 
bound holds on the empirical Rademacher complexity 

m 

where Yi is a diagonal matrix of the first k singular values of {E — E) with E = 
{diag{Xj)~^X)^, E = {diag{Xj)~^X)^ and X = J — X where diag{-) is a diagonal 
matrix of its vector input. 


Proof. Consider the following 


RAUCf^Q^ = 2 E^ 


sup — 

QgQ m rijn' ^ ^ 

^ ^ 1—1 ^ p—lq—1 


< —K 

m 


m 

2 B. 






sup i/i(ufvi - ufvi) 

IUIIf^RuJIVIIf^Rv"^ 




< 


< 


m 

2BR\jRy 

m 

2BR\jR-y 


sup tr(n'^UV-' E - n'^UV^ E) 

U||f<-Ru,||V||f<Rv 


E, [||(E-E)n'' 


E- Ell 


where IlJ'j = Vi for all i and off-diagonal elements are zero. The second line results 
from an application of Theorem 13.51 The fourth line is a matrix representation of 
the sum in the sup term and in the fifth line we use Lemma l3.II For the final line, 
note that the diagonal elements of 11'^ are selected from {—1, -1-1} and so ||n ‘"||2 = 1 
and an application of Theorem 13.61 shows the singular values do not change, giving 
the required result. □ 


Notice that this bound does not depend on the dimensionality of the factor 
matrices U and V. Here we see the motivation for penalising Optimisation [3] by 
the norm of the weight matrices: doing so keeps the Rademacher complexity low. 
We would also benefit from choosing loss functions whose Lipschitz constant is 
small. Putting the parts together allows us to note the following. 

Corollary 3.1. Consider a sample S = U™ ..., (wi, )} drawn 

from distribution x • • • x DfiP. Using the notations defined above the follow¬ 
ing bound holds on the Rademacher complexity of Qh, with probability greater than 
1 -^, 


R^^^iQh) < 


2BRuRv, 


E- Eli 


\ 


21n(l/<5)(n- 1)2 


y_ 
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We now want to make the connection between the expectation of the AUC and 
the Rademacher complexity. To do so we introduce the following lemma. 

Lemma 3.2. Let Q be a function class mapping from W x to [0,1]. Then with 
probability at least 1 — 5 over all samples S drawn from T), the following holds: 


sup (E-d [Q] - Es [Q ]) < Esr^v [ sup (E-p [Q] - Es [Q])] + 
Qg Q Qg q 




ln(l/<5) 

2m^ 



Proof. We can use a similar proof technique to that of Theorem 1,3.41 to write 


|Es[Q]-%JQ]|< 


n — 1 

m\uJa\\u:a\ ’ 


Noting that the left hand side can be written as |(E[(5] — E^ [Q]) — (E[(5] — Eg[Q])| 
and taking the supremum over Q allows us to bound |supQgQ(E[Q] — Es[(5]) — 
suPq6q(]E[Q] -%JQ])|: 

< I sup (E[Q] - Eg [Q]) - (E[Q] - E^ [Q]) | 

QGQ 

< n — 1 

~ mlwollwol’ 

Define the following function 


nS)= sup{E[Q]-Es[Q]), 

Qee 

then the above bound can be used in conjunction with McDiarmid’s theorem and 
we can say 


Pp(/'(^)-E[/'] >e)<exp 


-2e^ 


where Cab = 


E m ^Ui 2 i ’ 

^ 2=1 2-^j—l ^ij J 

If we equate the right side of the above to 5 we have 


/ ln(l/a) 1 ( Ti-l V 

y ~2mT~ Zji=l \u>i\ y \uii\ J ' 


□ 


Finally we show how the empirical expectations oi Q G Q found on different 
samples are related to the Rademacher complexity in the following lemma. 

Lemma 3.3. Let S and S be sampled from the distribution P and consider a 
sequence of Rademacher variables vi,..., Vm, then the following statement is true 

Eg g sup [Eg[Q] - Es[Q]] < 

’ QgQ 

Proof. We start by showing the following is equivalent to supgg Q[Eg[Q]-Es[g]]: 

^ m ^ rii n- 

^ ^ 7 ^ V ^ ^ yuJipi Ui^iq) yujip'^ yujiq)) 7 

m ^ n^n'■ ^ ^ 

t—l ^ p—l q—l 

since when Vi = —1 for all i the two expressions are clearly the same, and Vi = 1 
swaps the observations for the zth user in S with the corresponding ones in S. Since 
both S and S are sampled from the the same distribution the expectation of the 


Eg g^ sup 
QgQ 
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supremum over these sets is the same. Note also that above term is upper bounded 
by the following: 


sup[]Eg[Q] -Es[Q]] < 

Qee 


sup 

Qee 


^ m ^ rii 'iT'i 

^ ^ J ^ ^ ^ ^ Dujipi IJuiiq 

m ^ n^n^ ^ ^ 
i=l ‘ p=l g=l 


+Es.s^. sup 
QgQ 


21Es.s^ sup 
QgQ 


^ m ^ Ui '>^i 

- Y] - 7 Y Y ViQiWi, yuJip, ycoiq) 

m UiTi - 

2=1 ^ 2p=lg=l 


^ m ^ rii 

'y ^ t 'y ^ ^ ^ I^iQiWi, yuiip, yoiiq) 

m Uim 

i=l * p=l q=\ 


where the second line comes from the symmetry of distribution of the Rademacher 
variables. □ 


We can now bound the expectation of the loss using the Rademacher complexity. 


Theorem 3.8. Let S and S be sampled from the distribution T> and let Q be a 
function class mapping from W x to [0,1]. The with probability at least 1 — S, 
the following holds: 


Ed[Q] <Es[g] + i?'^'^^(Q) 


2m2 

Proof. The result follows directly from Lemmas 13.21 and 13.31 


\ 


ln(l/(5)(n — 1)^ 


E 


□ 


Given this result we can examine our loss functions in terms of the bound on the 
expectation. A key advantage of the bound is it is data-dependent and hence we can 
estimate the model complexity if we replace the Rademacher complexity term with 
the bound on its empirical estimate. We have already discussed when this quantity 
is small as well as a similar analysis for the final term. In practice one finds that 
the bound on the Rademacher term is larger than the last term hence once must 
be focus on this term. When studying the loss functions of Section 12.11 one can see 
that the logistic and sigmoid functions are Lipschitz with constant 1. If we say that 
the norms of and are bounded by D then we have —2D^ < ^i^p^q < . The 

hinge loss is given by |max(0,l — xf^ which implies D = l/\/4 so that the loss 
term is in the range [0,1] and the Lipschitz constant is 3/2. 


4. Related Work 

In this section we will briefly review some works on finding low rank matrix 
factorisations for implicit rating matrices. An early attempt to deal with implicit 
ratings using a squared loss is presented in [T51[2T]. The formulation is an adapta¬ 
tion of the Singular Value Decomposition (SVD), and minimizes 

m n 

™EE^b(urvj - l)2-t A(||U||| + IIVIII), 

i=i j=i 

where C is a matrix of weights for user-item pairs and A is a regularisation param¬ 
eter. In m the user-item values are set to 1 for positive items and lower constants 
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for the rest. A related problem is that of matrix completion HI EH] in which one 
minimises the Frobenius norm difference between real and predicted ratings using 
a trace norm penalisation (the sum of the singular values, denoted || • ||»), 

+ A||Z||,, 

i,j€ui 

where A is a user-defined regularisation parameter and Z is the matrix factorisation. 
Notice that only nonzero elements of X are considered in the cost. A key strength of 
matrix completion is the strong theoretical guarantees on finding unknown entries 
with high accuracy. The disadvantage of both these approaches is that they do not 
take into account the orderings of the items. 

In [28] the authors study AUC maximisation for matrix factorisation in the con¬ 
text of recommendation, the primary motivation for which is a maximum posterior 
estimate for a Bayesian framing of the problem. The connection to AUC max¬ 
imisation is made by replacing the indicator function in its computation with the 
log sigmoid denoted by In( 7 ( 2 ;) = ln(l/(l -|- e~^ j). Solutions are obtained using a 
stochastic gradient descent algorithm based on bootstrap sampling. The particular 
optimisation considered takes the form 

m X 

l0gcr(ufvp - ufv,) - -^||U||| 

i—1 pGoJi q^uJi 



where Au, Av^, Avq are regularisation constants for the user factors, positive items 
and negative items respectively. The first term is a log-sigmoid unnormalised re¬ 
laxation of the AUC criterion and the remaining terms are used for regularisation. 
Note that one optimises over the full list as opposed to prioritising the top few. 
Furthermore, our framework specialises to BPR in logistical loss case and when the 
regularisation parameters above are equivalent. 

Another paper which considers the AUC is [28] however it departs from other 
papers in using an item factor model of the form Z^ = which is 

the score for the fth user and jth item. The disadvantage of the item modelling 
approach is that it does not model users separately. Indeed, the connection between 
the user-item factor approach can be seen by setting 
optimisation, the AUC is approximated by the hinge loss, 

m 

min EEE max(0, 1 - Zjp -I- Zig), 

2=1 pGcUi q£C)i 

and one applies stochastic gradient descent using one user, one positive and one 
negative item chosen at random at each gradient step. As noted by the authors, 
this loss does not prioritise the top of the list and hence they propose another loss 

min^ ^ ^ max(0,1 - Z^p -|- Z,,), 

where <I>(a:) = is a sampled approximation of < 

1 — Zig) i.e. the rank of pth item for ith user. Additionally, the algorithm samples 
ranks of positive items in non-uniform ways in order to prioritise the top of the list. 
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A related approach based on Mean Reciprocal Rank (MRR) [24] increases the 
importance of top fc-ranked items in conjunction with maximising a lower bound on 
the smoothed MRR of the list. In words, the MRR is the average of the reciprocal 
of the rank of the first correct prediction for each user. The authors use a sigmoid 
function to approximate the indicator and after relaxation of a lower bound of the 
reciprocal rank, the following function is maximised 

(lnu(Z,,)+Eln(l-X,fcu(Z,fe-Zy)) ) -^(||U||| + ||V|||), 

ij \ k=l / 

where A is a regularisation constant. The first term in the sum promotes elements 
in the factor model that correspond to relevant items and the second term degrades 
the relevance scores of the irrelevant items relative to relevant item yj . 

5. Computational Experiments 

In this section we analyse various properties of MFAUC empirically in order to 
understand its behaviour and ranking performance. We consider the question of 
how well MFAUC optimises the AUC on the training set under specific loss functions 
and weightings. Another important question is whether the parallel optimisation 
procedure can speed up convergence of the objective. A final consideration is the 
evaluation of our ranking framework and other matrix factorisation methods when 
considering results at the very top of the list for each user. For this comparison 
we benchmark against other user-item matrix factorisation methods including Soft 
Impute [TH] and Weighted Regularised Matrix Factorisation (WRMF, [H]). Notice 
that the choice of the logistic loss function and identity weighting implies a com¬ 
parison to BPR. All experimental code is written in Python with critical sections 
written in Cython and C-I--I-. 

We make use of the following synthetic datasets. The first, Syntheticl, is 
generated in the following manner. Two random matrices U* G ^sooxs -y* g 
R 200 x 8 gj.g constructed such that respectively their columns are orthogonal. We 
then compute the partially observed matrix X^ = I(u^Vj > (5(s, 1 — t)) where 
(5(s, 1 — ti) represents the quantile corresponding to the top ti items. Here ti = .1 so 
that there are on average 20 nonzero elements per row of X, and we additionally add 
an average of 5 random relevant ratings per row to form our final rating matrix X. 
For the second dataset, Synthetic2, we generate U and V in a similar way, however 
this time the probability of observing a rating (relevant/irrelevant) is distributed 
according to the power law for each item/user with exponent 1. We iteratively 
sample observations according to this distribution of users and items, setting Xy 
to be 1 if Zij > E[Z], Z = UV^, otherwise it is zero. We continue this process 
until the density of the matrix is at least .1. The final matrix is of size 573 x 300 
with 17796 non-zeros. 

5.1. ROC Analysis of Losses. First we analyse the maximisation of AUC under 
the loss functions we outlined in Section [Q using the synthetic datasets. The 
MFAUC algorithm is set up with k = 8 using kw = 30 row samples and Ky = 10 
column samples to compute approximate derivatives for and v^. The initial 
values of U and V are set to zero mean Gaussian random values with a standard 
deviation of .1, the regularisation constant A = 0, maximum iterations T = 500 
and item distribution exponent t = 0. The learning rate is a = .05 and we choose 
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(3 S {.5,1.0, 2.0} for sigmoid and logistic losses and p S {.5,1.0, 2.0} for the tanh 
item weighting. To make the problem harder we remove 5 items for each user 
and then train using the remaining items, recording the ROC curve at the end of 
the procedure. This process is repeated 5 times, averaging results. Since we are 
interested in the top items of the list, we consider items until a false positive rate 
of 20% is encountered. For clarity we only include the best ROC curves for logistic, 
sigmoid and tanh-based losses, defined as the largest true positive rate at 20% false 
positives. 




(a) Syntheticl (b) Synthetic2 

Figure 3. Left side of ROC curves for the synthetic datasets using 

different loss functions. 

The ROC plots are presented in Figure [3] and on the whole we see that differences 
are slight for both datasets. On Syntheticl the tanh weighting function is slightly 
preferential at the start of the curve relative to the other losses, followed by hinge, 
sigmoid and then the logistic loss. The square loss provides the worst of the results 
as it penalises only the difference in the scores for positive and negative items, 
however one does gain a speed advantage as outlined in Section [221 We can see 
from the curves on Synthetic2 that this is a more challenging problem. We observe 
the logistic loss providing the fewest errors at the start of the ranking followed 
closely by hinge and then the tanh loss. The sigmoid and square loss curves are 
poor in this case, the latter for the reason we have already mentioned. Performing 
gradient descent over the sigmoid loss function can be challenging for the reasons 
we mentioned in Section o 

5.2. Optimisation Strategy. We now examine the parallel optimisation proposed 
in Section [221 in terms of convergence and timing. The MFAUC algorithm is set up 
with k = 8 using ku = 30 row samples and Ky = 10 column samples to compute 
approximate derivatives for and v^. The initial values of U and V are set to 
random Gaussian values, then we fix learning rate a = .05, regularisation A = .1, 
maximum iterations T = 300, and use the hinge loss. To see how the parallelisation 
affects convergence and timings, we record these values for the parallel optimisation 
with 2, 4, and 8 processes, as well as the standard SGD approach, repeating exper¬ 
iments 5 times to get average quantities for each parameter set. The experiment 
is run on an Intel core i7-3740QM GPU with 8 cores and 16 GB of RAM and we 
record the objective value every 5 iterations. 















20 


AUC OPTIMISATION AND COLLABORATIVE FILTERING 




Figure 4. Plots showing the objective function on the synthetic 
datasets with different optimisation routines. 



12 4 8 

Syntheticl 

Synthetic2 

607.1 309.1 163.6 106.9 

689.2 332.8 201.9 16.3 


Table 1. Timings (in seconds) of the optimisation routines with 
the synthetic datasets by number of processes. 


Figure |4] compares the objective values and Table [T] shows the timings of the 
parallel and non-parallel variants of SGD on both datasets. We observe approximate 
speedups of 3 times with 8 processes whilst converging to approximately the same 
objective values for both datasets. When we look at the objective against the 
iteration number, the parallel SGD converges slower than the non-parallel version 
particularly when using 8 cores. One of the reasons for this decrease in convergence 
rate is that the dataset is split into smaller blocks when more processes are applied. 
Overall however, parallel SGD matches the objective of SGD at a few smaller time 
cost and we naturally expect this improvement factor to increase with higher core 
GPUs. 

5.3. Comparative Ranking Performance. Here we concern ourselves with how 
well the different matrix factorisation methods can rank items in a top-£ recommen¬ 
dation task. From each user, 5 randomly selected relevant elements are removed 
and then a prediction is made for the top £ items. In this case, £ G {1, 3, 5} and the 
average values of the precision, recall and AUC are recorded. 

As an initial step in the learning process we perform model selection on the 
training nonzero elements using 3-fold cross validation. For MFAUC the param¬ 
eters are identical to the setup used above and we select learning rates from 
a G {2“^,..., 2“®} and A G {2°,..., 2“^°}. The initial factor matrices are com¬ 
puted using the randomised SVD. The FI measure is recorded on validation items 
in conjunction with MFAUC and used to pick the best solution. We take a sample 
of 3 validation items from the users to form this validation set. With Soft Impute, 
we use regularisation parameters A G {1.0, .8,... ,0} and the randomised SVD to 
update solutions as proposed in [8] . The regularisation parameters for WRMF are 
chosen from A G {2^, 2“^,..., 2“^^}. To select the best parameters we use the 
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p@l 

p@3 

p@5 

rOl 

r@3 

r@5 

AUC 


Softimpute 


.831 

.696 

.549 

.166 

.417 

.549 

.914 


WRMF 


.893 

.754 

.610 

.179 

.452 

.610 

.924 


MFAUC hinge 


.855 

.759 

.606 

.171 

.455 

.606 

.924 


MFAUC square 


.863 

.755 

.607 

.173 

.453 

.607 

.922 

1 

sH 

MFAUC sigmoid 


.867 

.772 

.626 

.173 

.463 

.626 

.923 

>> 

CO 

MFAUC logistic 


.872 

.762 

.614 

.174 

.457 

.614 

.924 


MFAUC tanh p = 

.5 

.854 

.752 

.602 

.171 

.451 

.602 

.923 


MFAUC tanh p = 

1.0 

.866 

.760 

.615 

.173 

.456 

.615 

.924 


MFAUC tanh p = 

2.0 

.874 

.775 

.618 

.175 

.465 

.618 

.921 


MFAUC tanh p = 

5.0 

.880 

.764 

.604 

.176 

.459 

.604 

.920 


Softimpute 


.225 

.193 

.172 

.045 

.116 

.172 

.766 


WRMF 


.431 

.297 

.241 

.086 

.178 

.241 

.817 


MFAUC hinge 


.428 

.294 

.240 

.086 

.176 

.240 

.796 


MFAUC square 


.422 

.290 

.236 

.084 

.174 

.236 

.801 

CN 

sH 

MFAUC sigmoid 


.416 

.284 

.228 

.083 

.170 

.228 

.795 

>> 

CO 

MFAUC logistic 


.444 

.295 

.238 

.089 

.177 

.238 

.800 


MFAUC tanh p = 

.5 

.415 

.284 

.232 

.083 

.171 

.232 

.791 


MFAUC tanh p = 

1.0 

.420 

.284 

.230 

.084 

.170 

.230 

.790 


MFAUC tanh p = 

2.0 

.395 

.285 

.231 

.079 

.171 

.231 

.797 


MFAUC tanh p = 

5.0 

.362 

.259 

.214 

.072 

.156 

.214 

.776 


Table 2. Test errors on the synthetic datasets. Top represents 
Syntheticl and bottom is Synthetic2. Best results are in bold. 


maximum FI scores on the test items averaged over all folds, fixing k = 8 a,s this is 
the dimension used to generate the data. Once we have found the optimal parame¬ 
ters we train using the training observations and test on the remaining elements to 
get estimates of precision, recall and AUC. The training is repeated 5 times with 
different random seeds and the resulting evaluation metrics are averaged. 

Table [2] shows the performance of the matrix factorisation methods. On both 
datasets, MFAUC and WRMF perform better than Soft Impute and particularly 
on Synthetic2. One explanation is that WRMF and MLAUC do not make as¬ 
sumptions about the distributions of the relevant items like Soft Impute. On the 
harder Synthetic2 dataset WRMF gives the best overall results closely followed 
by logistic and hinge loss MLAUC. 

5.3.1. Real Datasets. Next we consider a set of real world datasets: MovieLens, 
Flixster, Epinions and Book Crossing. For the MovieLens and Flixster datasets, 
ratings are given on scales of 1 to 5 and those greater than 3 are considered relevant 
with the remaining ones set to zero. Epinions ratings are given on the scale 0 to 5 
and Book Crossing ones are given from 0 to 10, and we assign relevance to ratings 
greater than 3 and 4 respectively. Eor all datasets, we remove users with less than 
10 items and items with less than 2 users, repeating this process until convergence 
is reached. Properties about the resulting matrices are shown in Table [31 

The experimental procedure is similar to before except that we select k G 
{32,64,128} in this case and use 2 model selection repetitions. Since the datasets 
are larger than the synthetic ones we perform model selection on a subsample of at 
most 10® ratings from the complete matrices. To form the model selection matrix. 
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Dataset 

users 

items 

nonzeros 

sparsity (%) 

Book Crossing 

9571 

68,517 

640,430 

0.098 

Epinions 

12,663 

38,499 

371,969 

0.076 

Flixster 

43,979 

32,024 

5,147,187 

0.37 

MovieLens 

897 

1281 

54,883 

4.78 


Table 3. Properties of the real datasets 


we pick users sequentially until the desired number of ratings is reached. Any items 
which then have no ratings are removed. The parameters for MFAUC are chosen 
from a G {2^,..., 2“^}, A G {2°,..., 2“®} and Kyy = 15. After the model selection 
step, we use the parallel SGD procedure in conjunction with MLAUC to compute 
the final matrix factorisation. 

Table |4] shows the results on these datasets. It is clear that the non-ranking 
based methods perform reasonably well on the more sparse datasets Epinions and 
Book Crossing relative to MLAUC. We noticed in the model selection stage for 
example, MLAUC would not converge adequately for many parameter sets. Whilst 
this did not negatively impact the AUC, it did effect the items at the very top of 
the list, hence the low precision and recall scores for these datasets. In contrast, we 
see with Flixster and MovieLens that the ranking methods show their advantage 
particularly with the hinge and logistic losses. With MovieLens for example, all of 
the ranking losses improve over WRMF in the precisions at 3 and 5. 

A useful comparison is between the hinge and tank losses since it demonstrates 
the effectiveness of prioritising list elements. The results are mixed: on the Epinions 
and MovieLens datasets we gain an improvement with this prioritisation function, 
but on Flixster results are worse. A difficultly of the approach is that one must 
correctly set p for accurate results. 

6. Conclusion 

Recommendation is a Learning To Rank (LTR) problem, in that for each user 
the system has to rank items according to the relevance for the user. Whilst early 
recommender systems based on matrix factorisation aimed to recover the complete 
matrix of ratings, recent advances focus on optimising scoring losses designed for 
LTR problems. The current paper pushes forward this domain by considering local 
AUC maximisation, which focuses on the ranking of top items. The corresponding 
loss is handled with a smooth surrogate function which is minimised through sto¬ 
chastic gradient descent. Furthermore, use of parallel architectures by a blockwise 
partitioning of the rating matrix in conjunction with stochastic gradient descent 
allows the algorithm to run on datasets with millions of known entries. In addition 
we show that our chosen loss functions are consistent with AUC and gained insight 
into the generalisation of the algorithms using Rademacher Theory. 

From the computational study we can conclude that the proposed parallelisation 
by block optimisation speeds up the convergence whilst keeping the quality of the 
objective relative to single process optimisation. The weighting of observations, 
which gives more importance to items which are more often relevant, can be effective 
for certain datasets. In general, the MFAUC approach is a useful tool when the 
sparsity of the underlying dataset under examination is not too high. 
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p@l 

p@3 

p@5 

r@l 

r@3 

r@5 

AUC 


Softimpute 

.040 

.031 

.027 

.008 

.019 

.027 

.782 

fciO 

WRMF 

.040 

.030 

.026 

.008 

.018 

.026 

.754 

a 

•H 

MFAUC hinge 

.018 

.016 

.014 

.004 

.010 

.014 

.845 

W 

W 

MFAUC sigmoid 

.015 

.013 

.012 

.003 

.008 

.012 

.849 

O 

u 

r "s 

MFAUC logistic 

.020 

.016 

.014 

.004 

.010 

.014 

.836 


MFAUC tanh p = .5 

.015 

.012 

.010 

.003 

.007 

.010 

.797 

O 

O 

MFAUC tanh p = 1.0 

.018 

.015 

.013 

.004 

.009 

.013 

.833 

PQ 

MFAUC tanh p = 2.0 

.018 

.016 

.014 

.004 

.009 

.014 

.846 


Softimpute 

.037 

.030 

.026 

.007 

.018 

.026 

.793 


WRMF 

.041 

.031 

.027 

.008 

.019 

.027 

.771 


MFAUC hinge 

.025 

.020 

.019 

.005 

.012 

.019 

.826 

s 

o 

MFAUC sigmoid 

.025 

.020 

.018 

.005 

.012 

.018 

.819 

•H 

MFAUC logistic 

.034 

.027 

.023 

.007 

.016 

.023 

.853 

•H 

Oh 

MFAUC tanh p = .5 

.027 

.024 

.021 

.005 

.014 

.021 

.824 

U 

MFAUC tanh p = 1.0 

.031 

.026 

.023 

.006 

.016 

.023 

.854 


MFAUC tanh p = 2.0 

.033 

.028 

.024 

.007 

.017 

.024 

.859 


Softimpute 

.157 

.111 

.089 

.031 

.067 

.089 

.926 


WRMF 

.167 

.119 

.096 

.033 

.071 

.096 

.891 

1, 

MFAUC hinge 

.168 

.132 

.112 

.034 

.079 

.112 

.984 

H 

0) 

Jj> 

MFAUC sigmoid 

.121 

.090 

.076 

.024 

.054 

.076 

.980 

w 

X 

MFAUC logistic 

.167 

.126 

.107 

.033 

.076 

.107 

.984 

•H 

1—1 

MFAUC tanh p = .5 

.119 

.090 

.077 

.024 

.054 

.077 

.981 

[Jh 

MFAUC tanh p = 1.0 

.058 

.049 

.043 

.012 

.029 

.043 

.967 


MFAUC tanh p = 2.0 

.069 

.052 

.047 

.014 

.031 

.047 

.969 


Softimpute 

.209 

.165 

.140 

.042 

.099 

.140 

.880 


WRMF 

.225 

.174 

.145 

.045 

.104 

.145 

.891 

W 

MFAUC hinge 

.217 

.183 

.158 

.043 

.110 

.158 

.919 

a 

0) 

MFAUC square 

.211 

.176 

.157 

.042 

.106 

.157 

.925 

1-4 

0) 

MFAUC sigmoid 

.237 

.193 

.165 

.047 

.116 

.165 

.924 

> 

o 

MFAUC logistic 

.241 

.191 

.166 

.048 

.114 

.166 

.926 

s: 

MFAUC tanh p = .5 

.223 

.185 

.161 

.045 

.111 

.161 

.922 


MFAUC tanh p = 1.0 

.226 

.183 

.159 

.045 

.110 

.159 

.923 


MFAUC tanh p = 2.0 

.222 

.180 

.157 

.044 

.108 

.157 

.918 


Table 4. Test errors on the real datasets. 
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